Example Notebook (Risk Factors for Cervical Cancer)

In [1]:
import os
import xai
import logging as log 
import warnings
import matplotlib.pyplot as plt
import sys, os
from util.commons import *
from util.ui import *
from util.model import *
from util.split import *
from util.dataset import *
from IPython.display import display, HTML

Load a dataset

In this notebook a dataset named 'Risk Factors for Cervical Cancer'. The dataset was collected at 'Hospital Universitario de Caracas' in Caracas, Venezuela. The dataset comprises demographic information, habits, and historic medical records of 858 patients. Several patients decided not to answer some of the questions because of privacy concerns (missing values).

In [2]:
dataset, msg = get_dataset('cervical_cancer')
display(msg)
display(dataset.df)
"Dataset 'cervical_cancer (Risk Factors for Cervical Cancer)' loaded successfully. For further information about this dataset please visit: https://archive.ics.uci.edu/ml/datasets/Cervical+cancer+%28Risk+Factors%29#"
Age Number of sexual partners First sexual intercourse Num of pregnancies Smokes Smokes (years) Smokes (packs/year) Hormonal Contraceptives Hormonal Contraceptives (years) IUD IUD (years) STDs STDs (number) STDs:condylomatosis STDs:cervical condylomatosis STDs:vaginal condylomatosis STDs:vulvo-perineal condylomatosis STDs:syphilis STDs:pelvic inflammatory disease STDs:genital herpes STDs:molluscum contagiosum STDs:AIDS STDs:HIV STDs:Hepatitis B STDs:HPV STDs: Number of diagnosis STDs: Time since first diagnosis STDs: Time since last diagnosis Dx:Cancer Dx:CIN Dx:HPV Dx Hinselmann Schiller Citology Biopsy
0 18 4.0 15.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 ? ? 0 0 0 0 0 0 0 0
1 15 1.0 14.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 ? ? 0 0 0 0 0 0 0 0
2 34 1.0 ? 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 ? ? 0 0 0 0 0 0 0 0
3 52 5.0 16.0 4.0 1.0 37.0 37.0 1.0 3.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 ? ? 1 0 1 0 0 0 0 0
4 46 3.0 21.0 4.0 0.0 0.0 0.0 1.0 15.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 ? ? 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
853 34 3.0 18.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 ? ? 0 0 0 0 0 0 0 0
854 32 2.0 19.0 1.0 0.0 0.0 0.0 1.0 8.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 ? ? 0 0 0 0 0 0 0 0
855 25 2.0 17.0 0.0 0.0 0.0 0.0 1.0 0.08 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 ? ? 0 0 0 0 0 0 1 0
856 33 2.0 24.0 2.0 0.0 0.0 0.0 1.0 0.08 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 ? ? 0 0 0 0 0 0 0 0
857 29 2.0 20.0 1.0 0.0 0.0 0.0 1.0 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 ? ? 0 0 0 0 0 0 0 0

858 rows × 36 columns

Preprocess the dataset

The dataset will be used same as described here: https://christophm.github.io/interpretable-ml-book/cervical.html All unknown values (\?) are going to be set to 0.0.

In [3]:
df = dataset.df.drop(columns=['Smokes (packs/year)', 'STDs:condylomatosis', 'STDs:cervical condylomatosis', 'STDs:genital herpes',
                              'STDs:Hepatitis B', 'STDs:vulvo-perineal condylomatosis', 'Dx:HPV',
                              'STDs:molluscum contagiosum', 'STDs:syphilis', 'STDs:AIDS', 'Hinselmann',
                              'STDs:pelvic inflammatory disease', 'STDs:HPV', 'Dx:CIN', 'Dx', 'STDs:HIV',
                              'Schiller', 'STDs:vaginal condylomatosis', 'Dx:Cancer', 'Citology'], axis=1)

num_cols = ['Number of sexual partners', 'First sexual intercourse', 'Num of pregnancies', 'Smokes',
            'Smokes (years)', 'Hormonal Contraceptives', 'Hormonal Contraceptives (years)', 'IUD', 
            'IUD (years)', 'STDs', 'STDs (number)', 'STDs: Time since first diagnosis',
            'STDs: Time since last diagnosis']

df = normalize_undefined_values('?', df)

str_limit = 5
for col in df.columns:
    if col in num_cols and len(df[col].unique()) > str_limit:
        df[col] = df[col].astype('float')
    elif col in num_cols and len(df[col].unique()) <= str_limit:
        df[col] = df[col].astype(str)
        
df
Out[3]:
Age Number of sexual partners First sexual intercourse Num of pregnancies Smokes Smokes (years) Hormonal Contraceptives Hormonal Contraceptives (years) IUD IUD (years) STDs STDs (number) STDs: Number of diagnosis STDs: Time since first diagnosis STDs: Time since last diagnosis Biopsy
0 18 4.0 15.0 1.0 0.0 0.0 0.0 0.00 0.0 0.0 0.0 0.0 0 0.0 0.0 0
1 15 1.0 14.0 1.0 0.0 0.0 0.0 0.00 0.0 0.0 0.0 0.0 0 0.0 0.0 0
2 34 1.0 0.0 1.0 0.0 0.0 0.0 0.00 0.0 0.0 0.0 0.0 0 0.0 0.0 0
3 52 5.0 16.0 4.0 1.0 37.0 1.0 3.00 0.0 0.0 0.0 0.0 0 0.0 0.0 0
4 46 3.0 21.0 4.0 0.0 0.0 1.0 15.00 0.0 0.0 0.0 0.0 0 0.0 0.0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
853 34 3.0 18.0 0.0 0.0 0.0 0.0 0.00 0.0 0.0 0.0 0.0 0 0.0 0.0 0
854 32 2.0 19.0 1.0 0.0 0.0 1.0 8.00 0.0 0.0 0.0 0.0 0 0.0 0.0 0
855 25 2.0 17.0 0.0 0.0 0.0 1.0 0.08 0.0 0.0 0.0 0.0 0 0.0 0.0 0
856 33 2.0 24.0 2.0 0.0 0.0 1.0 0.08 0.0 0.0 0.0 0.0 0 0.0 0.0 0
857 29 2.0 20.0 1.0 0.0 0.0 1.0 0.50 0.0 0.0 0.0 0.0 0 0.0 0.0 0

858 rows × 16 columns

Visualize the dataset

Three visualization functions offered by the XAI module will be used for analyzing the dataset.

  • In the first plot below, for example, we can see that the majority of samples (people) are not suffering from cervical cancer.
  • The second and third plots show correlations between the features. The second one plots the correlations as a matrix, whereas the third one as a hierarchical dendogram.
In [4]:
%matplotlib inline
plt.style.use('ggplot')
warnings.filterwarnings('ignore')

imbalanced_cols = ['Biopsy']

xai.imbalance_plot(df, *imbalanced_cols, categorical_cols=['Biopsy'])
xai.correlations(df, include_categorical=True, plot_type="matrix")
xai.correlations(df, include_categorical=True)
Out[4]:

Target

In the cell below the target variable is selected. The Biopsy serves as the gold standard for diagnosing cervical cancer, therefore we will use it as target.

In [5]:
df_X, df_y, msg = split_feature_target(df, "Biopsy")
df_y
18-Nov-20 21:35:29 - Target 'Biopsy' selected successfully.
Out[5]:
0      0
1      0
2      0
3      0
4      0
      ..
853    0
854    0
855    0
856    0
857    0
Name: Biopsy, Length: 858, dtype: int64

Training the models

In this step three models are going to be trained on this dataset. In the output below we can see classification reports for the trained models.

  • Model 1: Logistic Regression
  • Model 2: Random Forest
  • Model 3: Decision Tree
In [6]:
# Create three empty models
initial_models, msg = fill_empty_models(df_X, df_y, 3)
models = []

# Train model 1
model1 = initial_models[0]
msg = fill_model(model1, Algorithm.LOGISTIC_REGRESSION, Split(SplitTypes.IMBALANCED, None))
models.append(model1)

# Train model 2
model2 = initial_models[1]
msg = fill_model(model2, Algorithm.RANDOM_FOREST, Split(SplitTypes.IMBALANCED, None))
models.append(model2)

# Train model 3
model3 = initial_models[2]
msg = fill_model(model3, Algorithm.DECISION_TREE, Split(SplitTypes.IMBALANCED, None))
models.append(model3)
18-Nov-20 21:35:29 - Model accuracy: 0.6705426356589147
18-Nov-20 21:35:29 - Classification report: 
              precision    recall  f1-score   support

           0       0.94      0.69      0.80       241
           1       0.09      0.41      0.14        17

    accuracy                           0.67       258
   macro avg       0.51      0.55      0.47       258
weighted avg       0.89      0.67      0.75       258

18-Nov-20 21:35:29 - Model Model 1 trained successfully!
18-Nov-20 21:35:30 - Model accuracy: 0.9341085271317829
18-Nov-20 21:35:30 - Classification report: 
              precision    recall  f1-score   support

           0       0.94      1.00      0.97       241
           1       0.50      0.06      0.11        17

    accuracy                           0.93       258
   macro avg       0.72      0.53      0.54       258
weighted avg       0.91      0.93      0.91       258

18-Nov-20 21:35:30 - Model Model 2 trained successfully!
18-Nov-20 21:35:30 - Model accuracy: 0.8643410852713178
18-Nov-20 21:35:30 - Classification report: 
              precision    recall  f1-score   support

           0       0.93      0.93      0.93       241
           1       0.00      0.00      0.00        17

    accuracy                           0.86       258
   macro avg       0.46      0.46      0.46       258
weighted avg       0.87      0.86      0.87       258

18-Nov-20 21:35:30 - Model Model 3 trained successfully!
In [7]:
model_1 = models[0]
model_2 = models[1]
model_3 = models[2]

Global model interpretations

In the following steps we will use global interpretation techniques that help us to answer questions like how does a model behave in general? What features drive predictions and what features are completely useless. This data may be very important in understanding the model better. Most of the techniques work by investigating the conditional interactions between the target variable and the features on the complete dataset.

Feature importance

The importance of a feature is the increase in the prediction error of the model after we permuted the feature’s values, which breaks the relationship between the feature and the true outcome. A feature is “important” if permuting it increases the model error. This is because in that case, the model relied heavily on this feature for making right prediction. On the other hand, a feature is “unimportant” if permuting it doesn’t affect the error by much or doesn’t change it at all.

ELI5

In the first case, we use ELI5, which does not permute the features but only visualizes the weight of each feature.

  • Model 1 (Logistic Regression)
In [8]:
plot = generate_feature_importance_plot(FeatureImportanceType.ELI5, model_1)
display(plot)
18-Nov-20 21:35:30 - Generating a feature importance plot using ELI5 for Model 1 ...

y=1 top features

Weight? Feature
+0.650 STDs (number)
+0.375 STDs: Time since last diagnosis
+0.242 Smokes_0.0
+0.050 STDs_1.0
+0.045 Smokes (years)
+0.041 Hormonal Contraceptives (years)
+0.034 Age
+0.032 IUD (years)
+0.022 First sexual intercourse
+0.011 Hormonal Contraceptives_1.0
-0.027 Number of sexual partners
-0.035 IUD_1.0
-0.178 Num of pregnancies
-0.325 STDs: Time since first diagnosis
-0.390 IUD_0.0
-0.394 STDs: Number of diagnosis
-0.425 <BIAS>
-0.436 Hormonal Contraceptives_0.0
-0.475 STDs_0.0
-0.667 Smokes_1.0
  • Model 2 (Random Forest)
In [9]:
plot = generate_feature_importance_plot(FeatureImportanceType.ELI5, model_2)
display(plot)
18-Nov-20 21:35:30 - Generating a feature importance plot using ELI5 for Model 2 ...
Weight Feature
0.2019 ± 0.1232 Age
0.1773 ± 0.1075 First sexual intercourse
0.1570 ± 0.1202 Hormonal Contraceptives (years)
0.1451 ± 0.1038 Num of pregnancies
0.1168 ± 0.0892 Number of sexual partners
0.0241 ± 0.0419 Smokes (years)
0.0218 ± 0.0524 IUD (years)
0.0190 ± 0.0452 Hormonal Contraceptives_1.0
0.0161 ± 0.0491 STDs: Time since first diagnosis
0.0147 ± 0.0334 Hormonal Contraceptives_0.0
0.0143 ± 0.0429 STDs (number)
0.0143 ± 0.0404 STDs: Time since last diagnosis
0.0131 ± 0.0467 STDs_0.0
0.0125 ± 0.0392 IUD_1.0
0.0120 ± 0.0331 Smokes_0.0
0.0111 ± 0.0309 Smokes_1.0
0.0106 ± 0.0440 STDs: Number of diagnosis
0.0094 ± 0.0320 IUD_0.0
0.0090 ± 0.0314 STDs_1.0
  • Model 3 (Decision Tree)
In [10]:
plot = generate_feature_importance_plot(FeatureImportanceType.ELI5, model_3)
display(plot)
18-Nov-20 21:35:31 - Generating a feature importance plot using ELI5 for Model 3 ...
Weight Feature
0.2670 Age
0.1861 First sexual intercourse
0.1689 Hormonal Contraceptives (years)
0.1312 Num of pregnancies
0.0965 Number of sexual partners
0.0561 Smokes (years)
0.0381 STDs_1.0
0.0194 Hormonal Contraceptives_0.0
0.0139 IUD (years)
0.0081 Smokes_0.0
0.0063 STDs: Time since last diagnosis
0.0051 Hormonal Contraceptives_1.0
0.0034 STDs: Time since first diagnosis
0 STDs_0.0
0 STDs: Number of diagnosis
0 STDs (number)
0 Smokes_1.0
0 IUD_0.0
0 IUD_1.0
In [11]:
print(generate_feature_importance_explanation(FeatureImportanceType.ELI5, models, 4))
18-Nov-20 21:35:31 - Generating feature importance explanation for ELI5 ...
Summary:
 The highest feature for Model 1 is STDs (number) with weight ~0.65.
 The 2nd best feature for Model 1 is STDs: Time since last diagnosis with weight ~0.375.
 The 3rd most influential feature for Model 1 is Smokes_0.0 with weight ~0.242.
 The 4th best feature for Model 1 is STDs_1.0 with weight ~0.05.
 
 The most important feature for Model 2 is Age with weight ~0.202.
 The 2nd most valuable feature for Model 2 is First sexual intercourse with weight ~0.177.
 The 3rd most valuable feature for Model 2 is Hormonal Contraceptives (years) with weight ~0.157.
 The 4th highest feature for Model 2 is Num of pregnancies with weight ~0.145.
 
 The best feature for Model 3 is Age with weight ~0.267, alike 1st for Model 2.
 The 2nd most valuable feature for Model 3 is First sexual intercourse with weight ~0.186, matching 2nd for Model 2.
 The 3rd most influential feature for Model 3 is Hormonal Contraceptives (years) with weight ~0.169, similar to 3rd for Model 2.
 The 4th most important feature for Model 3 is Num of pregnancies with weight ~0.131, similar to 4th for Model 2.
 

Skater

In this step we use the Skater module, which permutes the features to generate a feature importance plot.

  • Model 1 (Logistic Regression)
In [12]:
%matplotlib inline
plt.rcParams['figure.figsize'] = [14, 15]
plt.style.use('ggplot')
warnings.filterwarnings('ignore')

_ = generate_feature_importance_plot(FeatureImportanceType.SKATER, model_1)
18-Nov-20 21:35:31 - Generating a feature importance plot using SKATER for Model 1 ...
18-Nov-20 21:35:31 - Initializing Skater - generating new in-memory model. This operation may be time-consuming so please be patient.
2020-11-18 21:35:31,837 - skater.core.explanations - WARNING - Progress bars slow down runs by 10-20%. For slightly 
faster runs, do progress_bar=False
[19/19] features ████████████████████ Time elapsed: 1 seconds
  • Model 2 (Random Forest)
In [13]:
_ = generate_feature_importance_plot(FeatureImportanceType.SKATER, model_2)
18-Nov-20 21:35:34 - Generating a feature importance plot using SKATER for Model 2 ...
18-Nov-20 21:35:34 - Initializing Skater - generating new in-memory model. This operation may be time-consuming so please be patient.
2020-11-18 21:35:34,771 - skater.core.explanations - WARNING - Progress bars slow down runs by 10-20%. For slightly 
faster runs, do progress_bar=False
[19/19] features ████████████████████ Time elapsed: 3 seconds
  • Model 3 (Decision Tree)
In [14]:
_ = generate_feature_importance_plot(FeatureImportanceType.SKATER, model_3)
18-Nov-20 21:35:38 - Generating a feature importance plot using SKATER for Model 3 ...
18-Nov-20 21:35:38 - Initializing Skater - generating new in-memory model. This operation may be time-consuming so please be patient.
2020-11-18 21:35:38,581 - skater.core.explanations - WARNING - Progress bars slow down runs by 10-20%. For slightly 
faster runs, do progress_bar=False
[19/19] features ████████████████████ Time elapsed: 1 seconds
In [15]:
print('\n' + generate_feature_importance_explanation(FeatureImportanceType.SKATER, models, 4))
18-Nov-20 21:35:41 - Generating feature importance explanation for SKATER ...
2020-11-18 21:35:41,086 - skater.core.explanations - WARNING - Progress bars slow down runs by 10-20%. For slightly 
faster runs, do progress_bar=False
[19/19] features ████████████████████ Time elapsed: 2 seconds
2020-11-18 21:35:43,462 - skater.core.explanations - WARNING - Progress bars slow down runs by 10-20%. For slightly 
faster runs, do progress_bar=False
[19/19] features ████████████████████ Time elapsed: 3 seconds
2020-11-18 21:35:46,780 - skater.core.explanations - WARNING - Progress bars slow down runs by 10-20%. For slightly 
faster runs, do progress_bar=False
[19/19] features ████████████████████ Time elapsed: 1 seconds
Summary:
 The best feature for Model 1 is Age with weight ~0.146.
 The 2nd most valuable feature for Model 1 is Num of pregnancies with weight ~0.129.
 The 3rd best feature for Model 1 is Hormonal Contraceptives_0.0 with weight ~0.1.
 The 4th highest feature for Model 1 is STDs: Time since last diagnosis with weight ~0.089.
 
 The most important feature for Model 2 is First sexual intercourse with weight ~0.157.
 The 2nd best feature for Model 2 is Num of pregnancies with weight ~0.156, same as 2nd for Model 1.
 The 3rd best feature for Model 2 is Age with weight ~0.148, same as 1st for Model 1.
 The 4th most valuable feature for Model 2 is Hormonal Contraceptives (years) with weight ~0.14.
 
 The highest feature for Model 3 is Age with weight ~0.267, matching 1st for Model 1.
 The 2nd highest feature for Model 3 is First sexual intercourse with weight ~0.173, identical to 1st for Model 2.
 The 3rd most important feature for Model 3 is Hormonal Contraceptives (years) with weight ~0.165, identical to 4th for Model 2.
 The 4th best feature for Model 3 is Num of pregnancies with weight ~0.111, same as 2nd for Model 1.
 

Shap

In the cell below we use the SHAP (SHapley Additive exPlanations). It uses a combination of feature contributions and game theory to come up with SHAP values. Then, it computes the global feature importance by taking the average of the SHAP value magnitudes across the dataset.

  • Model 1 (Logistic Regression)
In [16]:
from shap import initjs
initjs()

%matplotlib inline
plt.style.use('ggplot')
warnings.filterwarnings('ignore')

generate_feature_importance_plot(FeatureImportanceType.SHAP, model_1)
18-Nov-20 21:35:48 - Generating a feature importance plot using SHAP for Model 1 ...
18-Nov-20 21:35:48 - Initializing Shap - calculating shap values. This operation is time-consuming so please be patient.

  • Model 2 (Random Forest)
In [17]:
generate_feature_importance_plot(FeatureImportanceType.SHAP, model_2)
18-Nov-20 21:36:16 - Generating a feature importance plot using SHAP for Model 2 ...
18-Nov-20 21:36:16 - Initializing Shap - calculating shap values. This operation is time-consuming so please be patient.

  • Model 3 (Decision Tree)
In [18]:
generate_feature_importance_plot(FeatureImportanceType.SHAP, model_3)
18-Nov-20 21:37:06 - Generating a feature importance plot using SHAP for Model 3 ...
18-Nov-20 21:37:06 - Initializing Shap - calculating shap values. This operation is time-consuming so please be patient.

In [19]:
print(generate_feature_importance_explanation(FeatureImportanceType.SHAP, models, 4))
18-Nov-20 21:37:20 - Generating feature importance explanation for SHAP ...
Summary:
 The best feature for Model 1 is STDs: Time since last diagnosis with weight ~0.223.
 The 2nd most influential feature for Model 1 is STDs: Time since first diagnosis with weight ~0.183.
 The 3rd highest feature for Model 1 is Age with weight ~0.105.
 The 4th highest feature for Model 1 is Hormonal Contraceptives_0.0 with weight ~0.099.
 
 The most important feature for Model 2 is First sexual intercourse with weight ~0.046.
 The 2nd most influential feature for Model 2 is Hormonal Contraceptives (years) with weight ~0.043.
 The 3rd most influential feature for Model 2 is Age with weight ~0.042, alike 3rd for Model 1.
 The 4th most important feature for Model 2 is Number of sexual partners with weight ~0.039.
 
 The most valuable feature for Model 3 is Age with weight ~0.132, alike 3rd for Model 1.
 The 2nd highest feature for Model 3 is Hormonal Contraceptives (years) with weight ~0.113, alike 2nd for Model 2.
 The 3rd best feature for Model 3 is IUD (years) with weight ~0.096.
 The 4th most valuable feature for Model 3 is First sexual intercourse with weight ~0.077, matching 1st for Model 2.
 

Partial Dependence Plots

The partial dependence plot (short PDP or PD plot) shows the marginal effect one or two features have on the predicted outcome of a machine learning model. A partial dependence plot can show whether the relationship between the target and a feature is linear, monotonic or more complex. For example, when applied to a linear regression model, partial dependence plots always show a linear relationship.

PDPBox

PDPBox is the first module that we use for ploting partial dependence.

  • Model 1 (Logistic Regression)
In [20]:
generate_pdp_plots(PDPType.PDPBox, model_1, "Age", "None")
generate_pdp_plots(PDPType.PDPBox, model_1, "Age", "Number of sexual partners")
18-Nov-20 21:37:20 - Generating a PDP plot using PDPBox for Model 1 ...
18-Nov-20 21:37:20 - Generating a PDP plot using PDPBox for Model 1 ...
18-Nov-20 21:37:21 - findfont: Font family ['Arial'] not found. Falling back to DejaVu Sans.
18-Nov-20 21:37:21 - findfont: Font family ['Arial'] not found. Falling back to DejaVu Sans.
18-Nov-20 21:37:21 - findfont: Font family ['Arial'] not found. Falling back to DejaVu Sans.
18-Nov-20 21:37:22 - findfont: Font family ['Arial'] not found. Falling back to DejaVu Sans.
  • Model 2 (Random Forest)
In [21]:
generate_pdp_plots(PDPType.PDPBox, model_2, "Age", "None")
generate_pdp_plots(PDPType.PDPBox, model_2, "Age", "Number of sexual partners")
18-Nov-20 21:37:24 - Generating a PDP plot using PDPBox for Model 2 ...
18-Nov-20 21:37:25 - Generating a PDP plot using PDPBox for Model 2 ...
  • Model 3 (Decision Tree)
In [22]:
generate_pdp_plots(PDPType.PDPBox, model_3, "Age", "None")
generate_pdp_plots(PDPType.PDPBox, model_3, "Age", "Number of sexual partners")
18-Nov-20 21:37:32 - Generating a PDP plot using PDPBox for Model 3 ...
18-Nov-20 21:37:32 - Generating a PDP plot using PDPBox for Model 3 ...

Skater

  • Model 1 (Logistic Regression)
In [23]:
generate_pdp_plots(PDPType.SKATER, model_1, "Age", "Number of sexual partners")
18-Nov-20 21:37:36 - Generating a PDP plot using SKATER for Model 1 ...
2020-11-18 21:37:36,737 - skater.core.explanations - WARNING - Progress bars slow down runs by 10-20%. For slightly 
faster runs, do progressbar=False
[559/559] grid cells ████████████████████ Time elapsed: 81 seconds
  • Model 2 (Random Forest)
In [24]:
generate_pdp_plots(PDPType.SKATER, model_2, "Age", "Number of sexual partners")
18-Nov-20 21:38:59 - Generating a PDP plot using SKATER for Model 2 ...
2020-11-18 21:38:59,705 - skater.core.explanations - WARNING - Progress bars slow down runs by 10-20%. For slightly 
faster runs, do progressbar=False
[492/492] grid cells ████████████████████ Time elapsed: 98 seconds
  • Model 3 (Decision Tree)
In [25]:
generate_pdp_plots(PDPType.SKATER, model_3, "Age", "Number of sexual partners")
18-Nov-20 21:40:39 - Generating a PDP plot using SKATER for Model 3 ...
2020-11-18 21:40:40,239 - skater.core.explanations - WARNING - Progress bars slow down runs by 10-20%. For slightly 
faster runs, do progressbar=False
[572/572] grid cells ████████████████████ Time elapsed: 64 seconds

SHAP

  • Model 1 (Logistic Regression)
In [26]:
generate_pdp_plots(PDPType.SHAP, model_1, "Age", "Number of sexual partners")
18-Nov-20 21:41:45 - Generating a PDP plot using SHAP for Model 1 ...
  • Model 2 (Random Forest)
In [27]:
generate_pdp_plots(PDPType.SHAP, model_2, "Age", "Number of sexual partners")
18-Nov-20 21:41:46 - Generating a PDP plot using SHAP for Model 2 ...
  • Model 3 (Decision Tree)
In [28]:
generate_pdp_plots(PDPType.SHAP, model_3, "Age", "Number of sexual partners")
18-Nov-20 21:41:46 - Generating a PDP plot using SHAP for Model 3 ...

Local model interpretations

Local interpretation focuses on specifics of each individual and provides explanations that can lead to a better understanding of the feature contribution in smaller groups of individuals that are often overlooked by the global interpretation techniques. We will use two moduels for interpreting single instances - SHAP and LIME.

SHAP

SHAP leverages the idea of Shapley values for model feature influence scoring. The technical definition of a Shapley value is the “average marginal contribution of a feature value over all possible coalitions.” In other words, Shapley values consider all possible predictions for an instance using all possible combinations of inputs. Because of this exhaustive approach, SHAP can guarantee properties like consistency and local accuracy. LIME, on the other hand, does not offer such guarantees.

LIME

LIME (Local Interpretable Model-agnostic Explanations) builds sparse linear models around each prediction to explain how the black box model works in that local vicinity. While treating the model as a black box, we perturb the instance we want to explain and learn a sparse linear model around it, as an explanation. LIME has the advantage over SHAP, that it is a lot faster.

In [29]:
examples = [] + get_test_examples(model_1, ExampleType.FALSELY_CLASSIFIED, 2)
examples = examples + get_test_examples(model_2, ExampleType.TRULY_CLASSIFIED, 2)
examples
Out[29]:
[41, 129, 72, 143]
Example 1
In [30]:
print(get_example_information(model_1, examples[0]))
print(generate_single_instance_comparison(models, examples[0]))
Example 41's data: 
Age                                  29
Number of sexual partners             2
First sexual intercourse             18
Num of pregnancies                    4
Smokes                              0.0
Smokes (years)                        0
Hormonal Contraceptives             0.0
Hormonal Contraceptives (years)       0
IUD                                 0.0
IUD (years)                           0
STDs                                0.0
STDs (number)                         0
STDs: Number of diagnosis             0
STDs: Time since first diagnosis      0
STDs: Time since last diagnosis       0
Name: 335, dtype: object
Actual result for example 41: 1

Example 41 was truly classified by Model 2, Model 3 and falsely classified by Model 1.
 For further clarification see the explanations below.

  • Model 1 (Logistic Regression)
In [31]:
explanation = explain_single_instance(LocalInterpreterType.LIME, model_1, examples[0])
print(generate_single_instance_explanation(LocalInterpreterType.LIME, model_1, examples[0]))
explanation.show_in_notebook(show_table=True, show_all=True)
explanation = explain_single_instance(LocalInterpreterType.SHAP, model_1, examples[0])
print(generate_single_instance_explanation(LocalInterpreterType.SHAP, model_1, examples[0]))
display(explanation)
18-Nov-20 21:41:47 - Initializing LIME - generating new explainer. This operation may be time-consuming so please be patient.
The prediction probability of Model 1's decision for this example is 0.7. LIME's explanation: 
The feature that mainly affects Model 1's positive (1) prediction probability is STDs: Time since first diagnosis <= 0.00 with value of 0.3617.
The feature with the second most considerable change on Model 1's positive (1) prediction probability is Smokes=0.0 with value of 0.1711.
The third most effective feature for the positive (1) prediction probability of Model 1 is STDs: Number of diagnosis <= 0.00 with value of 0.0821
The feature that mostly influences Model 1's negative (0) prediction probability is STDs: Time since last diagnosis <= 0.00 with value of -0.4341.
The feature with the second largest affect on Model 1's negative (0) prediction probability is STDs (number) <= 0.00 with value of -0.1963.


The prediction probability of Model 1's decision for this example is 0.7. SHAP's explanation: 
The feature that mostly influences Model 1's positive (1) prediction probability is Hormonal Contraceptives_0.0 with value of 0.1023.
The feature with the second most substantial change on Model 1's positive (1) prediction probability is STDs: Time since last diagnosis with value of 0.0879.
The third most important feature for the positive (1) prediction probability of Model 1 is Num of pregnancies with value of 0.0833
The feature that largely impacts Model 1's negative (0) prediction probability is STDs: Time since first diagnosis with value of -0.0745.
The feature with the second most considerable impact on Model 1's negative (0) prediction probability is Age with value of -0.0157.


Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
  • Model 2 (Random Forest)
In [32]:
explanation = explain_single_instance(LocalInterpreterType.LIME, model_2, examples[0])
print(generate_single_instance_explanation(LocalInterpreterType.LIME, model_2, examples[0]))
explanation.show_in_notebook(show_table=True, show_all=True)
explanation = explain_single_instance(LocalInterpreterType.SHAP, model_2, examples[0])
print(generate_single_instance_explanation(LocalInterpreterType.SHAP, model_2, examples[0]))
display(explanation)
18-Nov-20 21:41:49 - Initializing LIME - generating new explainer. This operation may be time-consuming so please be patient.
The prediction probability of Model 2's decision for this example is 0.77. LIME's explanation: 
The feature that mostly affects Model 2's positive (1) prediction probability is Smokes=0.0 with value of 0.0228.
The feature with the second biggest change on Model 2's positive (1) prediction probability is Number of sexual partners > 3.00 with value of 0.0206.
The third most effective feature for the positive (1) prediction probability of Model 2 is Smokes (years) <= 0.00 with value of 0.0095
The feature that largely changes Model 2's negative (0) prediction probability is STDs=0.0 with value of -0.0779.
The feature with the second most substantial change on Model 2's negative (0) prediction probability is STDs (number) <= 0.00 with value of -0.0317.


The prediction probability of Model 2's decision for this example is 0.77. SHAP's explanation: 
The feature that mostly affects Model 2's positive (1) prediction probability is First sexual intercourse with value of 0.0366.
The feature with the second most substantial impact on Model 2's positive (1) prediction probability is Hormonal Contraceptives_1.0 with value of 0.012.
The third most impactful feature for the positive (1) prediction probability of Model 2 is Hormonal Contraceptives_0.0 with value of 0.0027
The feature that mostly influences Model 2's negative (0) prediction probability is Num of pregnancies with value of -0.0856.
The feature with the second biggest change on Model 2's negative (0) prediction probability is Number of sexual partners with value of -0.0694.


Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
  • Model 3 (Decision Tree)
In [33]:
explanation = explain_single_instance(LocalInterpreterType.LIME, model_3, examples[0])
print(generate_single_instance_explanation(LocalInterpreterType.LIME, model_3, examples[0]))
explanation.show_in_notebook(show_table=True, show_all=True)
explanation = explain_single_instance(LocalInterpreterType.SHAP, model_3, examples[0])
print(generate_single_instance_explanation(LocalInterpreterType.SHAP, model_3, examples[0]))
display(explanation)
18-Nov-20 21:41:52 - Initializing LIME - generating new explainer. This operation may be time-consuming so please be patient.
The prediction probability of Model 3's decision for this example is 1.0. LIME's explanation: 
The feature that mostly influences Model 3's positive (1) prediction probability is 1.00 < Num of pregnancies <= 2.00 with value of 0.0353.
The feature with the second most substantial change on Model 3's positive (1) prediction probability is Hormonal Contraceptives=1.0 with value of 0.0248.
The third most effective feature for the positive (1) prediction probability of Model 3 is Smokes=0.0 with value of 0.0184
The feature that mainly influences Model 3's negative (0) prediction probability is STDs=0.0 with value of -0.1634.
The feature with the second most substantial change on Model 3's negative (0) prediction probability is First sexual intercourse > 18.00 with value of -0.0533.


The prediction probability of Model 3's decision for this example is 1.0. SHAP's explanation: 


Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
Example 2
In [34]:
print(get_example_information(model_1, examples[1]))
print(generate_single_instance_comparison(models, examples[1]))
Example 129's data: 
Age                                  30
Number of sexual partners             3
First sexual intercourse             19
Num of pregnancies                    2
Smokes                              0.0
Smokes (years)                        0
Hormonal Contraceptives             1.0
Hormonal Contraceptives (years)       5
IUD                                 1.0
IUD (years)                           5
STDs                                1.0
STDs (number)                         1
STDs: Number of diagnosis             1
STDs: Time since first diagnosis      3
STDs: Time since last diagnosis       3
Name: 165, dtype: object
Actual result for example 129: 0

Example 129 was truly classified by Model 2, Model 3 and falsely classified by Model 1.
 For further clarification see the explanations below.

  • Model 1 (Logistic Regression)
In [35]:
explanation = explain_single_instance(LocalInterpreterType.LIME, model_1, examples[1])
print(generate_single_instance_explanation(LocalInterpreterType.LIME, model_1, examples[1]))
explanation.show_in_notebook(show_table=True, show_all=True)
explanation = explain_single_instance(LocalInterpreterType.SHAP, model_1, examples[1])
print(generate_single_instance_explanation(LocalInterpreterType.SHAP, model_1, examples[1]))
display(explanation)
The prediction probability of Model 1's decision for this example is 0.83. LIME's explanation: 
The feature that mostly affects Model 1's positive (1) prediction probability is STDs: Time since last diagnosis > 0.00 with value of 0.4327.
The feature with the second most substantial influence on Model 1's positive (1) prediction probability is STDs (number) > 0.00 with value of 0.196.
The third most impactful feature for the positive (1) prediction probability of Model 1 is Smokes=0.0 with value of 0.1603
The feature that primarily influences Model 1's negative (0) prediction probability is STDs: Time since first diagnosis > 0.00 with value of -0.3806.
The feature with the second most considerable influence on Model 1's negative (0) prediction probability is Smokes (years) <= 0.00 with value of -0.0883.


The prediction probability of Model 1's decision for this example is 0.83. SHAP's explanation: 
The feature that largely influences Model 1's positive (1) prediction probability is STDs: Time since last diagnosis with value of 0.1518.
The feature with the second largest change on Model 1's positive (1) prediction probability is STDs (number) with value of 0.1308.
The third most effective feature for the positive (1) prediction probability of Model 1 is STDs_0.0 with value of 0.0945
The feature that mainly impacts Model 1's negative (0) prediction probability is STDs: Time since first diagnosis with value of -0.1248.
The feature with the second largest impact on Model 1's negative (0) prediction probability is STDs: Number of diagnosis with value of -0.0763.


Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
  • Model 2 (Random Forest)
In [36]:
explanation = explain_single_instance(LocalInterpreterType.LIME, model_2, examples[1])
print(generate_single_instance_explanation(LocalInterpreterType.LIME, model_2, examples[1]))
explanation.show_in_notebook(show_table=True, show_all=True)
explanation = explain_single_instance(LocalInterpreterType.SHAP, model_2, examples[1])
print(generate_single_instance_explanation(LocalInterpreterType.SHAP, model_2, examples[1]))
display(explanation)
The prediction probability of Model 2's decision for this example is 1.0. LIME's explanation: 
The feature that largely influences Model 2's positive (1) prediction probability is Smokes=0.0 with value of 0.0177.
The feature with the second largest influence on Model 2's positive (1) prediction probability is Hormonal Contraceptives=1.0 with value of 0.0147.
The third most important feature for the positive (1) prediction probability of Model 2 is 0.00 < Hormonal Contraceptives (years) <= 0.12 with value of 0.0129
The feature that mainly influences Model 2's negative (0) prediction probability is STDs=0.0 with value of -0.0734.
The feature with the second biggest change on Model 2's negative (0) prediction probability is STDs (number) <= 0.00 with value of -0.0347.


The prediction probability of Model 2's decision for this example is 1.0. SHAP's explanation: 
The feature that mainly affects Model 2's positive (1) prediction probability is First sexual intercourse with value of 0.0541.
The feature with the second largest change on Model 2's positive (1) prediction probability is Age with value of 0.0386.
The third most important feature for the positive (1) prediction probability of Model 2 is STDs: Time since last diagnosis with value of 0.0083
The feature that mainly impacts Model 2's negative (0) prediction probability is Number of sexual partners with value of -0.0201.
The feature with the second most considerable change on Model 2's negative (0) prediction probability is Num of pregnancies with value of -0.0159.


Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
  • Model 3 (Decision Tree)
In [37]:
explanation = explain_single_instance(LocalInterpreterType.LIME, model_3, examples[1])
print(generate_single_instance_explanation(LocalInterpreterType.LIME, model_3, examples[1]))
explanation.show_in_notebook(show_table=True, show_all=True)
explanation = explain_single_instance(LocalInterpreterType.SHAP, model_3, examples[1])
print(generate_single_instance_explanation(LocalInterpreterType.SHAP, model_3, examples[1]))
display(explanation)
The prediction probability of Model 3's decision for this example is 1.0. LIME's explanation: 
The feature that mostly changes Model 3's positive (1) prediction probability is IUD (years) <= 0.00 with value of 0.0245.
The feature with the second most substantial change on Model 3's positive (1) prediction probability is Smokes=0.0 with value of 0.0212.
The third most influential feature for the positive (1) prediction probability of Model 3 is Hormonal Contraceptives=1.0 with value of 0.0194
The feature that largely changes Model 3's negative (0) prediction probability is STDs=0.0 with value of -0.1704.
The feature with the second most substantial influence on Model 3's negative (0) prediction probability is First sexual intercourse > 18.00 with value of -0.0488.


The prediction probability of Model 3's decision for this example is 1.0. SHAP's explanation: 


Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
Example 3
In [38]:
print(get_example_information(model_1, examples[2]))
print(generate_single_instance_comparison(models, examples[2]))
Example 72's data: 
Age                                  34
Number of sexual partners             1
First sexual intercourse              0
Num of pregnancies                    1
Smokes                              0.0
Smokes (years)                        0
Hormonal Contraceptives             0.0
Hormonal Contraceptives (years)       0
IUD                                 0.0
IUD (years)                           0
STDs                                0.0
STDs (number)                         0
STDs: Number of diagnosis             0
STDs: Time since first diagnosis      0
STDs: Time since last diagnosis       0
Name: 2, dtype: object
Actual result for example 72: 0

Example 72 was truly classified by Model 1, Model 2, Model 3 and falsely classified by no model.
 For further clarification see the explanations below.

  • Model 1 (Logistic Regression)
In [39]:
explanation = explain_single_instance(LocalInterpreterType.LIME, model_1, examples[2])
print(generate_single_instance_explanation(LocalInterpreterType.LIME, model_1, examples[2]))
explanation.show_in_notebook(show_table=True, show_all=True)
explanation = explain_single_instance(LocalInterpreterType.SHAP, model_1, examples[2])
print(generate_single_instance_explanation(LocalInterpreterType.SHAP, model_1, examples[2]))
display(explanation)
The prediction probability of Model 1's decision for this example is 0.63. LIME's explanation: 
The feature that mainly affects Model 1's positive (1) prediction probability is STDs: Time since first diagnosis <= 0.00 with value of 0.3564.
The feature with the second most considerable change on Model 1's positive (1) prediction probability is Smokes=0.0 with value of 0.1782.
The third most important feature for the positive (1) prediction probability of Model 1 is Age > 33.00 with value of 0.117
The feature that mainly changes Model 1's negative (0) prediction probability is STDs: Time since last diagnosis <= 0.00 with value of -0.4346.
The feature with the second most substantial impact on Model 1's negative (0) prediction probability is STDs (number) <= 0.00 with value of -0.2077.


The prediction probability of Model 1's decision for this example is 0.63. SHAP's explanation: 
The feature that mostly changes Model 1's positive (1) prediction probability is Hormonal Contraceptives_0.0 with value of 0.105.
The feature with the second most considerable influence on Model 1's positive (1) prediction probability is First sexual intercourse with value of 0.091.
The third most effective feature for the positive (1) prediction probability of Model 1 is STDs: Time since last diagnosis with value of 0.0902
The feature that largely impacts Model 1's negative (0) prediction probability is STDs: Time since first diagnosis with value of -0.077.
The feature with the second biggest change on Model 1's negative (0) prediction probability is Age with value of -0.0565.


Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
  • Model 2 (Random Forest)
In [40]:
explanation = explain_single_instance(LocalInterpreterType.LIME, model_2, examples[2])
print(generate_single_instance_explanation(LocalInterpreterType.LIME, model_2, examples[2]))
explanation.show_in_notebook(show_table=True, show_all=True)
explanation = explain_single_instance(LocalInterpreterType.SHAP, model_2, examples[2])
print(generate_single_instance_explanation(LocalInterpreterType.SHAP, model_2, examples[2]))
display(explanation)
The prediction probability of Model 2's decision for this example is 0.94. LIME's explanation: 
The feature that primarily changes Model 2's positive (1) prediction probability is Smokes=0.0 with value of 0.0221.
The feature with the second largest affect on Model 2's positive (1) prediction probability is Hormonal Contraceptives (years) > 2.00 with value of 0.0173.
The third most influential feature for the positive (1) prediction probability of Model 2 is Hormonal Contraceptives=1.0 with value of 0.0142
The feature that primarily influences Model 2's negative (0) prediction probability is STDs=0.0 with value of -0.0803.
The feature with the second biggest influence on Model 2's negative (0) prediction probability is STDs (number) <= 0.00 with value of -0.037.


The prediction probability of Model 2's decision for this example is 0.94. SHAP's explanation: 
The feature that mainly changes Model 2's positive (1) prediction probability is First sexual intercourse with value of 0.0902.
The feature with the second biggest change on Model 2's positive (1) prediction probability is STDs: Time since last diagnosis with value of 0.0198.
The third most influential feature for the positive (1) prediction probability of Model 2 is Num of pregnancies with value of 0.0082
The feature that mostly changes Model 2's negative (0) prediction probability is Number of sexual partners with value of -0.0533.
The feature with the second most substantial change on Model 2's negative (0) prediction probability is Age with value of -0.033.


Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
  • Model 3 (Decision Tree)
In [41]:
explanation = explain_single_instance(LocalInterpreterType.LIME, model_3, examples[2])
print(generate_single_instance_explanation(LocalInterpreterType.LIME, model_3, examples[2]))
explanation.show_in_notebook(show_table=True, show_all=True)
explanation = explain_single_instance(LocalInterpreterType.SHAP, model_3, examples[2])
print(generate_single_instance_explanation(LocalInterpreterType.SHAP, model_3, examples[2]))
display(explanation)
The prediction probability of Model 3's decision for this example is 1.0. LIME's explanation: 
The feature that largely affects Model 3's positive (1) prediction probability is Number of sexual partners <= 1.00 with value of 0.0447.
The feature with the second most substantial influence on Model 3's positive (1) prediction probability is STDs: Time since last diagnosis <= 0.00 with value of 0.0203.
The third most important feature for the positive (1) prediction probability of Model 3 is Hormonal Contraceptives (years) <= 0.00 with value of 0.0176
The feature that mostly impacts Model 3's negative (0) prediction probability is STDs=0.0 with value of -0.1702.
The feature with the second largest affect on Model 3's negative (0) prediction probability is Smokes (years) <= 0.00 with value of -0.0448.


The prediction probability of Model 3's decision for this example is 1.0. SHAP's explanation: 
The feature that primarily impacts Model 3's positive (1) prediction probability is First sexual intercourse with value of 0.3167.
The feature with the second largest impact on Model 3's positive (1) prediction probability is Hormonal Contraceptives_0.0 with value of 0.0167.
The feature that primarily impacts Model 3's negative (0) prediction probability is IUD (years) with value of -0.15.
The feature with the second most considerable influence on Model 3's negative (0) prediction probability is Hormonal Contraceptives (years) with value of -0.15.


Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
Example 4
In [42]:
print(get_example_information(model_1, examples[1]))
print(generate_single_instance_comparison(models, examples[1]))
Example 129's data: 
Age                                  30
Number of sexual partners             3
First sexual intercourse             19
Num of pregnancies                    2
Smokes                              0.0
Smokes (years)                        0
Hormonal Contraceptives             1.0
Hormonal Contraceptives (years)       5
IUD                                 1.0
IUD (years)                           5
STDs                                1.0
STDs (number)                         1
STDs: Number of diagnosis             1
STDs: Time since first diagnosis      3
STDs: Time since last diagnosis       3
Name: 165, dtype: object
Actual result for example 129: 0

Example 129 was truly classified by Model 2, Model 3 and falsely classified by Model 1.
 For further clarification see the explanations below.

  • Model 1 (Logistic Regression)
In [43]:
explanation = explain_single_instance(LocalInterpreterType.LIME, model_1, examples[3])
print(generate_single_instance_explanation(LocalInterpreterType.LIME, model_1, examples[3]))
explanation.show_in_notebook(show_table=True, show_all=True)
explanation = explain_single_instance(LocalInterpreterType.SHAP, model_1, examples[3])
print(generate_single_instance_explanation(LocalInterpreterType.SHAP, model_1, examples[3]))
display(explanation)
The prediction probability of Model 1's decision for this example is 0.53. LIME's explanation: 
The feature that mostly changes Model 1's positive (1) prediction probability is STDs: Time since first diagnosis <= 0.00 with value of 0.3647.
The feature with the second most considerable impact on Model 1's positive (1) prediction probability is Smokes=0.0 with value of 0.1675.
The third most influential feature for the positive (1) prediction probability of Model 1 is Age > 33.00 with value of 0.1212
The feature that largely affects Model 1's negative (0) prediction probability is STDs: Time since last diagnosis <= 0.00 with value of -0.4466.
The feature with the second biggest affect on Model 1's negative (0) prediction probability is STDs (number) <= 0.00 with value of -0.1956.


The prediction probability of Model 1's decision for this example is 0.53. SHAP's explanation: 
The feature that primarily changes Model 1's positive (1) prediction probability is IUD_0.0 with value of 0.0943.
The feature with the second biggest affect on Model 1's positive (1) prediction probability is Age with value of 0.0823.
The third most impactful feature for the positive (1) prediction probability of Model 1 is STDs: Time since first diagnosis with value of 0.0787
The feature that largely affects Model 1's negative (0) prediction probability is Num of pregnancies with value of -0.1287.
The feature with the second biggest change on Model 1's negative (0) prediction probability is STDs: Time since last diagnosis with value of -0.0905.


Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
  • Model 2 (Random Forest)
In [44]:
explanation = explain_single_instance(LocalInterpreterType.LIME, model_2, examples[3])
print(generate_single_instance_explanation(LocalInterpreterType.LIME, model_2, examples[3]))
explanation.show_in_notebook(show_table=True, show_all=True)
explanation = explain_single_instance(LocalInterpreterType.SHAP, model_2, examples[3])
print(generate_single_instance_explanation(LocalInterpreterType.SHAP, model_2, examples[3]))
display(explanation)
The prediction probability of Model 2's decision for this example is 0.99. LIME's explanation: 
The feature that mainly influences Model 2's positive (1) prediction probability is Number of sexual partners > 3.00 with value of 0.0234.
The feature with the second most substantial impact on Model 2's positive (1) prediction probability is 26.00 < Age <= 33.00 with value of 0.0033.
The feature that mostly affects Model 2's negative (0) prediction probability is STDs=0.0 with value of -0.0768.
The feature with the second biggest impact on Model 2's negative (0) prediction probability is STDs (number) <= 0.00 with value of -0.0361.


The prediction probability of Model 2's decision for this example is 0.99. SHAP's explanation: 
The feature that mainly impacts Model 2's positive (1) prediction probability is Smokes_1.0 with value of 0.0214.
The feature with the second biggest affect on Model 2's positive (1) prediction probability is Hormonal Contraceptives_0.0 with value of 0.0132.
The third most effective feature for the positive (1) prediction probability of Model 2 is First sexual intercourse with value of 0.0121
The feature that mostly influences Model 2's negative (0) prediction probability is Number of sexual partners with value of -0.0128.
The feature with the second largest impact on Model 2's negative (0) prediction probability is Age with value of -0.0106.


Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
  • Model 3 (Decision Tree)
In [45]:
explanation = explain_single_instance(LocalInterpreterType.LIME, model_3, examples[3])
print(generate_single_instance_explanation(LocalInterpreterType.LIME, model_3, examples[3]))
explanation.show_in_notebook(show_table=True, show_all=True)
explanation = explain_single_instance(LocalInterpreterType.SHAP, model_3, examples[3])
print(generate_single_instance_explanation(LocalInterpreterType.SHAP, model_3, examples[3]))
display(explanation)
The prediction probability of Model 3's decision for this example is 1.0. LIME's explanation: 
The feature that mostly impacts Model 3's positive (1) prediction probability is 1.00 < Num of pregnancies <= 2.00 with value of 0.0278.
The feature with the second most considerable impact on Model 3's positive (1) prediction probability is Smokes=0.0 with value of 0.0254.
The third most important feature for the positive (1) prediction probability of Model 3 is Hormonal Contraceptives (years) <= 0.00 with value of 0.0231
The feature that mainly influences Model 3's negative (0) prediction probability is STDs=0.0 with value of -0.1511.
The feature with the second most considerable impact on Model 3's negative (0) prediction probability is Smokes (years) <= 0.00 with value of -0.0243.


The prediction probability of Model 3's decision for this example is 1.0. SHAP's explanation: 
The feature that largely impacts Model 3's positive (1) prediction probability is Age with value of 0.0333.
The feature with the second largest influence on Model 3's positive (1) prediction probability is Hormonal Contraceptives_0.0 with value of 0.0333.
The third most influential feature for the positive (1) prediction probability of Model 3 is First sexual intercourse with value of 0.0333
The feature that mainly impacts Model 3's negative (0) prediction probability is IUD (years) with value of -0.05.
The feature with the second most substantial affect on Model 3's negative (0) prediction probability is Hormonal Contraceptives (years) with value of -0.05.


Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.